This is a temporary project that exists only to complete mandatory classwork for the Open Data Science class at Helsinki University. I expect to hide or remove this repository after December 2020.


Introduction to Open Data Science - Course Project

Chapter 1: About the project

This repository is for an introductory open data science class for PhD students at University of Helsinki (with some students from other universities). The course is open to everyone, uses open source tools, and emphasizes open science. The statistics part seems fairly elementary, but I could use a brush-up and that is why I signed up for the course.

The course seems to cover github, R, Rstudio, linear and logistic regression, dimensionality reduction, clustering, hypothesis testing and some other basic multivariate statistical analysis tools.

Links to my repositories and web pages for the class:


Chapter 2: Regression and model validation

Data wrangling exercise

Here is code and output from data/create_learning2014.R:

# Tatu Ylönen / University of Helsinki
# 1.11.2020
# RStudio Exercise 2 - data wrangling

# This uses the data set described in:
# Kimmo Vehkalahti: ASSIST 2014 - Phase 3 (end of Part 2), N=183
# Course: Johdatus yhteiskuntatilastotieteeseen, syksy 2014
# (Introduction to Social Statistics, fall 2014 - in Finnish),
# international survey of Approaches to Learning, made possible
# by Teachers' Academy funding for KV in 2013-2015.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the data into a data frame.
data <- read.csv(file="http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t")

# Print the dimensions of the data.
print("=== dimensions of the data JYTOPKYS3-data dataset")
## [1] "=== dimensions of the data JYTOPKYS3-data dataset"
dim(data)
## [1] 183  60
# Print a summary of what the data frame contains.
# Turns out there are 183 observations for 60 variables.
print("Structure of the JYTOPKYS3-data dataset")
## [1] "Structure of the JYTOPKYS3-data dataset"
str(data)
## 'data.frame':    183 obs. of  60 variables:
##  $ Aa      : int  3 2 4 4 3 4 4 3 2 3 ...
##  $ Ab      : int  1 2 1 2 2 2 1 1 1 2 ...
##  $ Ac      : int  2 2 1 3 2 1 2 2 2 1 ...
##  $ Ad      : int  1 2 1 2 1 1 2 1 1 1 ...
##  $ Ae      : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ Af      : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ ST01    : int  4 4 3 3 4 4 5 4 4 4 ...
##  $ SU02    : int  2 2 1 3 2 3 2 2 1 2 ...
##  $ D03     : int  4 4 4 4 5 5 4 4 5 4 ...
##  $ ST04    : int  4 4 4 4 3 4 2 5 5 4 ...
##  $ SU05    : int  2 4 2 3 4 3 2 4 2 4 ...
##  $ D06     : int  4 2 3 4 4 5 3 3 4 4 ...
##  $ D07     : int  4 3 4 4 4 5 4 4 5 4 ...
##  $ SU08    : int  3 4 1 2 3 4 4 2 4 2 ...
##  $ ST09    : int  3 4 3 3 4 4 2 4 4 4 ...
##  $ SU10    : int  2 1 1 1 2 1 1 2 1 2 ...
##  $ D11     : int  3 4 4 3 4 5 5 3 4 4 ...
##  $ ST12    : int  3 1 4 3 2 3 2 4 4 4 ...
##  $ SU13    : int  3 3 2 2 3 1 1 2 1 2 ...
##  $ D14     : int  4 2 4 4 4 5 5 4 4 4 ...
##  $ D15     : int  3 3 2 3 3 4 2 2 3 4 ...
##  $ SU16    : int  2 4 3 2 3 2 3 3 4 4 ...
##  $ ST17    : int  3 4 3 3 4 3 4 3 4 4 ...
##  $ SU18    : int  2 2 1 1 1 2 1 2 1 2 ...
##  $ D19     : int  4 3 4 3 4 4 4 4 5 4 ...
##  $ ST20    : int  2 1 3 3 3 3 1 4 4 2 ...
##  $ SU21    : int  3 2 2 3 2 4 1 3 2 4 ...
##  $ D22     : int  3 2 4 3 3 5 4 2 4 4 ...
##  $ D23     : int  2 3 3 3 3 4 3 2 4 4 ...
##  $ SU24    : int  2 4 3 2 4 2 2 4 2 4 ...
##  $ ST25    : int  4 2 4 3 4 4 1 4 4 4 ...
##  $ SU26    : int  4 4 4 2 3 2 1 4 4 4 ...
##  $ D27     : int  4 2 3 3 3 5 4 4 5 4 ...
##  $ ST28    : int  4 2 5 3 5 4 1 4 5 2 ...
##  $ SU29    : int  3 3 2 3 3 2 1 2 1 2 ...
##  $ D30     : int  4 3 4 4 3 5 4 3 4 4 ...
##  $ D31     : int  4 4 3 4 4 5 4 4 5 4 ...
##  $ SU32    : int  3 5 5 3 4 3 4 4 3 4 ...
##  $ Ca      : int  2 4 3 3 2 3 4 2 3 2 ...
##  $ Cb      : int  4 4 5 4 4 5 5 4 5 4 ...
##  $ Cc      : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ Cd      : int  4 5 4 4 3 4 4 5 5 5 ...
##  $ Ce      : int  3 5 3 3 3 3 4 3 3 4 ...
##  $ Cf      : int  2 3 4 4 3 4 5 3 3 4 ...
##  $ Cg      : int  3 2 4 4 4 5 5 3 5 4 ...
##  $ Ch      : int  4 4 2 3 4 4 3 3 5 4 ...
##  $ Da      : int  3 4 1 2 3 3 2 2 4 1 ...
##  $ Db      : int  4 3 4 4 4 5 4 4 2 4 ...
##  $ Dc      : int  4 3 4 5 4 4 4 4 4 4 ...
##  $ Dd      : int  5 4 1 2 4 4 5 3 5 2 ...
##  $ De      : int  4 3 4 5 4 4 5 4 4 2 ...
##  $ Df      : int  2 2 1 1 2 3 1 1 4 1 ...
##  $ Dg      : int  4 3 3 5 5 4 4 4 5 1 ...
##  $ Dh      : int  3 3 1 4 5 3 4 1 4 1 ...
##  $ Di      : int  4 2 1 2 3 3 2 1 4 1 ...
##  $ Dj      : int  4 4 5 5 3 5 4 5 2 4 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
# Create the analysis dataset with variables gender, age, attitude,
# deep, stra, surf and points.  First copy the data wrangling statements from
# the datacamp exercises.
data$attitude = data$Attitude / 10
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30",
                    "D06",  "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21",
                       "SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20",
                         "ST28")
deep_columns <- select(data, one_of(deep_questions))
data$deep <- rowMeans(deep_columns)
surface_columns <- select(data, one_of(surface_questions))
data$surf <- rowMeans(surface_columns)
strategic_columns <- select(data, one_of(strategic_questions))
data$stra = rowMeans(strategic_columns)

# Select only the desired fields for the analysis data set
anal <- select(data, c("gender", "Age", "attitude", "deep", "stra", "surf",
                       "Points"))
colnames(anal)[2] <- "age"
colnames(anal)[7] <- "points"
# Exclude observations where the exam points variable is zero from the
# analysis data set
anal <- filter(anal, points != 0)

# Show the structure of the analysis dataset
print("=== analysis dataset")
## [1] "=== analysis dataset"
str(anal)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# Write the analysis data set into a file
write.csv(anal, "data/analysis-dataset.csv")

# Demonstrate that we can read back the analysis data set
readback <- read.csv("data/analysis-dataset.csv")
print("=== structure of the readback dataset")
## [1] "=== structure of the readback dataset"
str(readback)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
print("=== head of the readback dataset")
## [1] "=== head of the readback dataset"
head(readback)
##   X gender age attitude     deep  stra     surf points
## 1 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6 6      F  38      3.8 4.750000 3.625 2.416667     21

Analysis exercise

Objective of the study

The objective of the study was to analyze how various factors influence the student’s skills in statistics.

Methods

In this exercise we analyzed the dataset described in Kimmo Vehkalahti: The Relationship between Learning Approaches and Students’ Achievements in an Introductory Statistics Course in Finland, 60th World Statistics Congress (ISI2015), Rio de Janeiro, Brazil, July 2015. The overall approach is covered more deeply in Tait, Entwistle and McKune: ASSIST: The Approaches and Study Skills Inventory for Students, 1998.

The data set was collected to understand how various factors influence students’ statistics skills. It includes data for 183 students. A total of 60 variables were colleted. The variables can be grouped into those relating to the student’s background and scholastic achievements, deep questions (seeking meaning, relating ideas, use of evidence), surface questions (lack of purpose, unrelated memorizing, syllabus-boundedness), and strategic questions (organized studying, time management).

Student’s exam points are taken as measuring the level of the student’s skill that we try to explain using the other variables. For this analysis, we chose to use linear least squares regression using the student’s attitude and the averaged answers to surface questions and strategic questions as explanatory variables. We chose to not include deep questions, age, or gender as a possible explanatory variables because the assignment instructions said we should use three variables; extending the analysis to include additional variables would be straightforward.

We additionally analyze the significance of the results using the t statistic, including a look at the distribution of the regression errors and whether their distribution is approximately normal to verify that the significance testing methods chosen can be used in this situation.

The analysis was performed using R (version 3.6.3). The R scripts and their output are included below.

Data preprocessing

The original data set can be found here and its detailed description here.

We preprocessed the data by averaging the answers to the deep questions, surface questions, and strategic questions. A sum of the answers had already been precomputed, so the sums were simply divided by the number of values (10) in each case. Additionally, student’s exam points, general attitude towards statistics, age, and gender were included as possible explanatory variables.

The data preprocessing was done using the script built for the data wrangling exercise, data/create_learning2014.R, shown above.

Observations where the exam points are zero were excluded from the analysis. This left 166 observations.

To get a quick overview of the data and the relationships between points and the possible explanatory variables, we first plot points against each variable.

library(ggplot2)
data <- read.csv("data/analysis-dataset.csv")
p <- ggplot(data, aes(x=deep, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=surf, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=stra, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=attitude, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=age, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

p <- ggplot(data, aes(x=gender, y=points)) + geom_point() + geom_smooth(method=lm)
p
## `geom_smooth()` using formula 'y ~ x'

It is clear that gender as a two-category variable is not suitable for linear regression. The data also contains relatively few observations for age above about 27, so it is probably not a very helpful explanatory variable either. Also the deep questions have most averages concentrated between 3 and 5, so it might be less useful for regression than the other variables; it also seems to have little influence on points based on the regression line. These observations contributed to choosing attitude, surf, and stra for our analysis.

Analysis

First, we compute a linear least squares regression, trying to explain points using the students’ attitude and the averaged answers for the surface and strategic questions.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

For statistically significant results we would expect p < 0.05. The estimated fit shows a positive coefficient for attitude (3.68, p=1.9 * 10-8, which is statistically significant). It also shows a negative coefficient for surface questions and a positive cofficient for strategic questions, but these are not statistically significant (p=0.47 and p=0.12, respectively).

The t statistic used for significance testing in this linear regression assumes that the distribution of errors follows the normal distribution. To validate this assumption, we first plot the residuals.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude + surf + stra, data=data)
plot(model)

The residuals vs. fitted plot shows no obvious pattern that would invalidate the p values. For a normal distribution we would expect the highest concentration of residuals around zero and the residuals should be independent of the fitted value. There is a small cluster of negative residuals near 24-27 and at extreme fitted values the residuals tend to be negative. However this effect seems small enough to not cause significant concern.

From the Q-Q plot we would expect residual distribution to match the theoretical distribution. we can see that while the regression seems to work fairly well for most data points, extreme errors on both sides tend to be more negative than predicted by a normal distribution. However the deviation is not very big. Overall, the plot looks to me as sufficiently consistent with a bounded normal distribution to consider the p values reasonably reliable, but some caution woul be warranted.

Finally, the residuals vs. leverage plot for a normal distribution should show the same spread regardless of leverage and the standardized residual density should follow a normal distribution. This appears to hold reasonably well in this case, though it is also possible that the spread decreases with leverage. There are too few data points at high leverage to be sure. Generally I would say the plot is consistent with the assumption of an approximately normal distribution of residuals.

We will now remove the variables that are not statistically significant from the regression. Thus, we compute the regression against just attitude.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
model <- lm(points ~ attitude, data=data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
plot(model)

There is little change in the coefficient of attitude compared to the previous regression (now 3.53, was 3.68). Its statistical significance has slightly improved (now p=4.12 * 10-9, which is highly significant). The intercept 11.64 is also highly significant (p=1.95 * 10-9).

Looking at the residuals vs fitted plot, it now looks like a better match against the normal distribution, though there seem to be fewer than expected samples further out. The Q-Q plot has not changed much, or perhaps the negative deviation at extremes has gotten stronger.

Given the very high significance from the t test and the approximate conformance with the normal distribution, it seems likely that the result for attitude truly is significant. However, I would not trust the absolute p value as there is some deviation from a normal distribution. I would seek to confirm the p value using nonparametric tests if submitting for a peer-reviewed scientific publication.

Finally, we try to analyze the correlation and explanatory power of attitude on points. For this, we use the Pearson correlation coefficient and use R2 to measure explanatory power. This assumes a normal distribution.

library(dplyr)
data <- read.csv("data/analysis-dataset.csv")
r = cor(data$attitude, data$points, method="pearson")
print(r)
## [1] 0.4365245
print(cat("correlation R", r, "\n"))
## correlation R 0.4365245 
## NULL
print(cat("R^2", r * r, "\n"))
## R^2 0.1905537 
## NULL

The R2 value of 0.19 suggests that attitude explains about 19% of the variance in points.

Conclusion

The student’s attitude toward statistics seems to have a statistically significant relationship with the student’s points on the exam that can be approximated with a linear model: \[ points = 3.53 \cdot attitude + 11.64 \]

This model explains about 19% of the variance in points. About 81% of variance is not explained by this model.

It should be noted that we have not analyzed whether higher points are caused by attitude or vice versa, or if the results could be better explained by some other common couse. We have merely detected a correlation.


Chapter 3: Logistic regression

Data wrangling

See the script create_alc.R in the data directory on github.

Analysis

In this exercise we analyzed how students’ alcohol usage can be predicted using other variables in the data set. The data set is the Student Performance Data Set from the Machine Learning Repository at UC Irwine. The dataset is described in P. Cortez and A. Silva: Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira (eds): Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008), pp. 5-12, Porto, Portugal, April 2008.

(2) Loading data and describing it

First, let’s look at an overview of the data. Looking at available field names, we select a few fields as potential candidates for further investigation.

library(dplyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt")

# Print column names
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
# Print overall structure of the data
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

The data includes numerous factors describing the the student’s background and lifestyle, as well problems with the student’s studies, such as failed classes or absences, and the student’s alcohol usage. Our task is to study the relationship between high/low alcohol consumption and the other data. In total, the data set includes 382 observations for 35 variables.

(3) Selecting four variables and hypotheses

Looking at the data, I decided to choose the following variables for closer inspection:

  • Medu (Mother’s education) - it is conceivable parent’s education could have an influence (generally higher education of parents correlates with school success, I think)
  • Fedu (Father’s education) - same reason
  • goout (How much goes out with friends - people often drink when they go out)
  • famrel (Quality of family relationships) - conceivably this could have an influence, for example kids of problem families having more problems of their own (i.e., negative correlation in this case).

(4) Graphical and numerical analysis of the variables

Let’s now look at them graphically using scatterplots (with some position jitter to make the point density more visible, given the discrete valus):

ggplot(alc, aes(x=Medu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(alc, aes(x=Fedu, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(alc, aes(x=goout, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

ggplot(alc, aes(x=famrel, y=alc_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5)) + geom_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'

Let’s also look at the data numerically using covariances.

# Covariances help us understand the relationship between different variables.
sub <- select(alc, one_of(c("alc_use", "Medu", "Fedu", "goout", "famrel")))
cov(sub, sub)
##              alc_use       Medu        Fedu      goout      famrel
## alc_use  0.975635899 0.01153619 0.009385607 0.43383353 -0.10978961
## Medu     0.011536189 1.18022289 0.773865963 0.08223056  0.01192783
## Fedu     0.009385607 0.77386596 1.201742452 0.04117025  0.01313710
## goout    0.433833533 0.08223056 0.041170246 1.28125902  0.08291077
## famrel  -0.109789614 0.01192783 0.013137101 0.08291077  0.84938368

We can see that alc_use has a covariance of 0.43 with goout and -0.11 with famrel, while the covariances with Medu and Fedu are small.

Visually and based on the linear regression line, it does not look like parents’ education has much impact on alcohol use. Nevertheless, they might have an impact when conditioning on other variables or after eliminating the impact of other variables, so let’s keep them in the analysis. Going out and family relationships, however, show a clear impact.

(5) Logistic regression

Let’s now focus on high alcohol use, i.e., the binarized variable high_use, and use logistic regression to analyze the impact of the selected variables on it. We will use the bootstrap method and

library(boot)

m <- glm(high_use ~ Medu + Fedu + goout + famrel, data=alc, family="binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ Medu + Fedu + goout + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6349  -0.7725  -0.5598   0.9912   2.4030  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.73933    0.68188  -2.551  0.01075 *  
## Medu        -0.10465    0.14883  -0.703  0.48194    
## Fedu         0.05742    0.14751   0.389  0.69707    
## goout        0.80690    0.11928   6.765 1.33e-11 ***
## famrel      -0.42459    0.13316  -3.189  0.00143 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 401.82  on 377  degrees of freedom
## AIC: 411.82
## 
## Number of Fisher Scoring iterations: 4

Looking at the significance level of the coefficients, we can see that Medu and Fedu are not statistically significant, while goout and famrel are statistically significant at the p=0.01 level. Let’s run the regression again but without Medu and Fedu.

library(boot)

m <- glm(high_use ~ goout + famrel, data=alc, family="binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ goout + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5586  -0.7385  -0.5676   0.9969   2.3758  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8742     0.6015  -3.116  0.00183 ** 
## goout         0.8004     0.1187   6.743 1.55e-11 ***
## famrel       -0.4218     0.1330  -3.171  0.00152 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 402.32  on 379  degrees of freedom
## AIC: 408.32
## 
## Number of Fisher Scoring iterations: 4

We can see that goout and famrel are still statistically significant at the p=0.01 level, with coefficients of 0.80 and -0.42, respectively. Their standard errors are at most of 1/3 of the absolute value of the means.

Let’s now look at the coefficients and their confidence intervals after exponentiation. Exponentiation effectively coverts addition into multiplication, and coefficients >1 indicate positive impact on high_use while coefficients <1 indicate negative impact on high_use.

OR <- coef(m) %>% exp
OR
## (Intercept)       goout      famrel 
##   0.1534844   2.2264964   0.6558574
CI <- exp(confint(m))
## Waiting for profiling to be done...
CI
##                  2.5 %    97.5 %
## (Intercept) 0.04577266 0.4877710
## goout       1.77580201 2.8309109
## famrel      0.50355350 0.8498297
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1534844 0.04577266 0.4877710
## goout       2.2264964 1.77580201 2.8309109
## famrel      0.6558574 0.50355350 0.8498297

We can see that the results are in line with our hypothesis and the results we got with non-exponentiated coefficients (i.e., negative value there corresponds to a value <1 in after exponentiation).

The exponentiated coefficients relate directly to the odds ratios. Essentially, in this case, an increase in goout by increases the odds by a factor of 2.2, and an increase in famrel decreases the odds by a factor of 0.66. The confidence intervals of these odds ratios are given directly by the confidence intervals for the exponentiated coefficients as reported by exp(confint(m)) above.

(6) Predictive power and cross tabulation

Let’s now explore the predictive power of the model.

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# Predict the probability of high_use using the model
probabilities <- predict(m, type="response")
alc <- mutate(alc, probability=probabilities)
alc <- mutate(alc, prediction=probability >= 0.5)

# Let's look at ten samples and how they were predicted vs. actual high_use
select(alc, high_use, prediction, Medu, Fedu, goout, famrel) %>% tail(10)
##     high_use prediction Medu Fedu goout famrel
## 373    FALSE      FALSE    1    1     2      4
## 374     TRUE      FALSE    4    2     3      5
## 375    FALSE      FALSE    2    2     3      5
## 376    FALSE      FALSE    4    4     3      4
## 377    FALSE      FALSE    2    3     2      5
## 378    FALSE      FALSE    3    1     4      4
## 379    FALSE      FALSE    1    1     1      1
## 380    FALSE      FALSE    1    1     1      1
## 381     TRUE       TRUE    3    1     5      2
## 382     TRUE      FALSE    3    2     1      4
# Let's also create the confusion matrix for the prediction
table(high_use=alc$high_use, prediction=alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   27
##    TRUE     64   48

Based on the ten samples, we see 8 correct predictions, one Type I error, and one Type II error. Looks reasonable. The confusion matrix shows that the prediction missed 70 cases of high alcohol use and correctly predicted 42.

Let’s also look at a scatterplot of the predictions vs. actual values (though this is not very useful as we only have four possible combinations of values - but a plot with position jitter will slow something).

ggplot(alc, aes(x=prediction, y=high_use)) + geom_point(shape=1, position=position_jitter(width=0.5, height=0.5))

The plot supports our earlier analysis of the confusion matrics.

(7) Cross-validation

Let’s now use 10-fold cross-validation to estimate how sensitive the model is to sampling of the data. (Loss function was already defined above)

cv <- cv.glm(data=alc, cost=loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation, e.g., error
cv$delta[1]
## [1] 0.2539267

The cross validation indicates an error of 0.25 for my model. This is better than the error of the model introduced in DataCamp (0.26). (Note, however, that due to the random nature of cross-validation, and using only 10-fold cross validation as mandated by the exercise, the accuracy of these errors is not very high and the errors will randomly vary somewhat from run to run. I’ve also seen runs where it is higher than 0.26.)

(8) Comparing different models using cross-validation

Let’s now see how the model error behaves as the number of variables decreases. For this analysis we’ll first use all variables (except high_use, alc_use, Dalc, Walc, probability, and prediction) as explanatory variables, and then start removing variables that have low significance. Finally, we will collect the cross-validation errors and plot the change in error as a function of the number of variables removed. We’ll use 29-fold cross-validation to reduce the estimation error while keeping running times reasonable (R warns about K=30; using 29 instead reduces garbage in this report).

# These columns are always excluded
always_exclude = c("high_use", "alc_use", "Dalc", "Walc",
  "prediction", "probability")

# Dataframe where we collect the errors from each step in simplification
# (this is updated by the step() function)
results = data.frame(count=integer(), error=double(), last=character())

# Estimate a model that excludes the variables given as argument (plus the
# variables that are always excluded), performs cross-validation to estimate
# the model error, and collects the errors into ``results``.
step <- function(additional_excludes) {
  excludes <- c(always_exclude, additional_excludes)
  cols <- colnames(alc)[!colnames(alc) %in% excludes]
  formula_text <- paste("high_use ~ ", paste(cols, collapse=" + "))
  formula = as.formula(formula_text)
  m <- glm(formula, data=alc)
  probabilities <- predict(m, type="response")
  alc <- mutate(alc, probability=probabilities)
  alc <- mutate(alc, prediction=probability >= 0.5)
  cv <- cv.glm(data=alc, cost=loss_func, glmfit=m, K=29)
  error <- cv$delta[1]  # error
  count <- length(additional_excludes)
  if (length(additional_excludes) == 0) {
    last = "<none>"
  } else {
    last <- additional_excludes[[length(additional_excludes)]]
  }
  results <<- rbind(results, data.frame(count, error, last))
  return(m)
}

# Perform one step for the exercise.  We can add steps by adding more fields
# to be excluded (as many as we like).
m <- step(c())  # famsup has lowest significance
m <- step(c("famsup"))
m <- step(c("famsup", "schoolsup"))
m <- step(c("famsup", "schoolsup", "traveltime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid"))
m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
  "nursery"))
summary(m)
## 
## Call:
## glm(formula = formula, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7470  -0.2983  -0.1224   0.2964   1.0904  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.023323   0.105566   0.221  0.82526    
## sexM         0.181264   0.041632   4.354 1.72e-05 ***
## famrel      -0.077049   0.022620  -3.406  0.00073 ***
## goout        0.135759   0.018450   7.358 1.17e-12 ***
## absences     0.011949   0.002735   4.369 1.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1632952)
## 
##     Null deviance: 79.162  on 381  degrees of freedom
## Residual deviance: 61.562  on 377  degrees of freedom
## AIC: 398.78
## 
## Number of Fisher Scoring iterations: 2
results
##    count     error       last
## 1      0 0.2382199     <none>
## 2      1 0.2460733     famsup
## 3      2 0.2460733  schoolsup
## 4      3 0.2382199 traveltime
## 5      4 0.2460733       Fjob
## 6      5 0.2382199       Fedu
## 7      6 0.2460733       Fjob
## 8      7 0.2329843    famsize
## 9      8 0.2382199         G2
## 10     9 0.2329843   failures
## 11    10 0.2356021   internet
## 12    11 0.2329843         G1
## 13    12 0.2356021         G3
## 14    13 0.2198953  studytime
## 15    14 0.2277487 activities
## 16    15 0.2172775     school
## 17    16 0.2172775     health
## 18    17 0.2146597       Mjob
## 19    18 0.2120419       Medu
## 20    19 0.2068063   freetime
## 21    20 0.2146597     reason
## 22    21 0.2172775    Pstatus
## 23    22 0.2172775    address
## 24    23 0.2146597        age
## 25    24 0.2041885   guardian
## 26    25 0.2172775   romantic
## 27    26 0.2120419     higher
## 28    27 0.2172775       paid
## 29    28 0.2146597    nursery

This leaves us with four variables, goout, absences, sex, and famrel that are all statistically highly significant (at p=0.001 level). The prediction error is at the 0.21 level. For the fun of it, let’s see what happens if we remove more.

m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
  "nursery", "famrel"))
summary(m)
## 
## Call:
## glm(formula = formula, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8046  -0.3032  -0.1538   0.3365   1.0327  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.263330   0.064620  -4.075 5.61e-05 ***
## sexM         0.172743   0.042136   4.100 5.07e-05 ***
## goout        0.130710   0.018646   7.010 1.10e-11 ***
## absences     0.012497   0.002768   4.514 8.50e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1678754)
## 
##     Null deviance: 79.162  on 381  degrees of freedom
## Residual deviance: 63.457  on 378  degrees of freedom
## AIC: 408.36
## 
## Number of Fisher Scoring iterations: 2
results
##    count     error       last
## 1      0 0.2382199     <none>
## 2      1 0.2460733     famsup
## 3      2 0.2460733  schoolsup
## 4      3 0.2382199 traveltime
## 5      4 0.2460733       Fjob
## 6      5 0.2382199       Fedu
## 7      6 0.2460733       Fjob
## 8      7 0.2329843    famsize
## 9      8 0.2382199         G2
## 10     9 0.2329843   failures
## 11    10 0.2356021   internet
## 12    11 0.2329843         G1
## 13    12 0.2356021         G3
## 14    13 0.2198953  studytime
## 15    14 0.2277487 activities
## 16    15 0.2172775     school
## 17    16 0.2172775     health
## 18    17 0.2146597       Mjob
## 19    18 0.2120419       Medu
## 20    19 0.2068063   freetime
## 21    20 0.2146597     reason
## 22    21 0.2172775    Pstatus
## 23    22 0.2172775    address
## 24    23 0.2146597        age
## 25    24 0.2041885   guardian
## 26    25 0.2172775   romantic
## 27    26 0.2120419     higher
## 28    27 0.2172775       paid
## 29    28 0.2146597    nursery
## 30    29 0.2120419     famrel

Turns out prediction error did not really change from removing famrel (even though it was statistically highly significant). Let’s see what happens if we not remove sex, the next least significant variable.

m <- step(c("famsup", "schoolsup", "traveltime", "Fjob", "Fedu",
  "Fjob", "famsize", "G2", "failures", "internet", "G1", "G3", "studytime",
  "activities", "school", "health", "Mjob", "Medu", "freetime", "reason",
  "Pstatus", "address", "age", "guardian", "romantic", "higher", "paid",
  "nursery", "famrel", "sex"))
summary(m)
## 
## Call:
## glm(formula = formula, data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8224  -0.2859  -0.1266   0.3944   1.0561  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.191905   0.063512  -3.022  0.00269 ** 
## goout        0.135835   0.018988   7.154 4.39e-12 ***
## absences     0.011713   0.002819   4.155 4.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1748772)
## 
##     Null deviance: 79.162  on 381  degrees of freedom
## Residual deviance: 66.278  on 379  degrees of freedom
## AIC: 422.97
## 
## Number of Fisher Scoring iterations: 2
results
##    count     error       last
## 1      0 0.2382199     <none>
## 2      1 0.2460733     famsup
## 3      2 0.2460733  schoolsup
## 4      3 0.2382199 traveltime
## 5      4 0.2460733       Fjob
## 6      5 0.2382199       Fedu
## 7      6 0.2460733       Fjob
## 8      7 0.2329843    famsize
## 9      8 0.2382199         G2
## 10     9 0.2329843   failures
## 11    10 0.2356021   internet
## 12    11 0.2329843         G1
## 13    12 0.2356021         G3
## 14    13 0.2198953  studytime
## 15    14 0.2277487 activities
## 16    15 0.2172775     school
## 17    16 0.2172775     health
## 18    17 0.2146597       Mjob
## 19    18 0.2120419       Medu
## 20    19 0.2068063   freetime
## 21    20 0.2146597     reason
## 22    21 0.2172775    Pstatus
## 23    22 0.2172775    address
## 24    23 0.2146597        age
## 25    24 0.2041885   guardian
## 26    25 0.2172775   romantic
## 27    26 0.2120419     higher
## 28    27 0.2172775       paid
## 29    28 0.2146597    nursery
## 30    29 0.2120419     famrel
## 31    30 0.2513089        sex

This time we see a significant increase in error, to the 0.25 level. We thus conclude that goout, absences, and sex is the best combination that we can get to using this method. It is, however, possible that we are at a local optimum and different choices earlier would have resulted in a better final result. However I think that is unlikely.

Let’s finish with a plot of the error as a function of the number of variables removed.

data = select(results, count, error)
plot(data$count, data$error,
  xlab="variables removed", ylab="error") + title(
    "Prediction error vs. number of variables removed")

## integer(0)

Chapter 4: Clustering and classification

Analysis

Let’s start by loading the libraries we’ll use.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)

(2) Loading the Boston dataset from the MASS library

In this exercise we are analyzing the “Boston” dataset from the R MASS library. This example dataset is further described in D. Harrison and D.L. Rubinfeld: Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5:81-102, 1978. The data set relates to the housing values in the suburbs of Boston and several variables that might predict housing prices.

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Thre are a total of 506 observations for 14 variables. All variables have numerical values.

(3) Graphical overview of the data

ggpairs(Boston, axislabels="show")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'axislabels' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

We can see that the distributions of the variables vary quite a bit. Some have a neat near-normal distribution, while others are highly skewed and far from normally distributed.

Some variables are neatly linearly correlated, while others so non-linear correlation (e.g., having a “bump” in the middle in the 2-D plot).

We can see from the pairwise plots of the variables that there are significant correlations between many of the variables. In particular, if we look at the median house value (medv, bottom row / rightmost column), we can see that it might have correlations with all or most of the other variables, though in some cases the correlation might not be linear (e.g., age, dis, zn).

Let’s also plot the correlations between the variables to understand the strength of the correlations between each pair of variables:

cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

We can see that medv has the highest positive or negative correlations with, e.g., rm (average number of rooms per dwelling) and lstat (percentage of population with lower status), but many other correlations are also significant.

(4) Standardizing the dataset and creating categorical variable for crime rate and divide to train/test sets

Let’s then standardize the data set to zero mean and unit standard deviation.

# Standardize the test set
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# Create a new categorial variable with values High, Medium, Low from
# the crime rate, replacing variable ``crim`` by ``crime``
bins = quantile(boston_scaled$crim)
labels = c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim,
             breaks=bins,
             labels=labels,
             include.lowest=TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# Divide the dataset to train and test sets, so that 80% of the data belongs
# to the train set.
cnt <- nrow(boston_scaled)
ind <- sample(cnt, size=cnt * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 0.713 -0.487 -0.487 -0.487 ...
##  $ indus  : num  1.231 0.569 1.015 -1.202 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  0.434 -0.783 1.194 -0.947 -0.144 ...
##  $ rm     : num  -0.2614 -0.0507 -0.7652 2.5396 0.5542 ...
##  $ age    : num  0.868 0.31 1.077 0.264 0.665 ...
##  $ dis    : num  -0.7179 -0.0855 -1.0266 -0.1424 0.2108 ...
##  $ rad    : num  -0.522 -0.637 1.66 -0.867 -0.637 ...
##  $ tax    : num  -0.0311 -0.8202 1.5294 -0.7846 -0.6007 ...
##  $ ptratio: num  -1.735 -0.118 0.806 -0.21 1.175 ...
##  $ black  : num  -1.276 0.441 0.399 0.441 0.258 ...
##  $ lstat  : num  -0.3981 -0.2889 1.0176 -1.1823 -0.0943 ...
##  $ medv   : num  0.268 -0.21 -1.526 1.758 -0.167 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 3 1 4 2 3 4 1 4 3 3 ...
str(test)
## 'data.frame':    102 obs. of  14 variables:
##  $ zn     : num  0.2845 -0.4872 0.0487 0.0487 -0.4872 ...
##  $ indus  : num  -1.287 -0.593 -0.476 -0.476 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.265 -0.265 -0.144 ...
##  $ rm     : num  0.413 0.194 -0.399 -0.563 -0.478 ...
##  $ age    : num  -0.12 0.367 0.615 -1.051 -0.241 ...
##  $ dis    : num  0.14 0.557 1.328 0.786 0.433 ...
##  $ rad    : num  -0.982 -0.867 -0.522 -0.522 -0.637 ...
##  $ tax    : num  -0.666 -0.986 -0.577 -0.577 -0.601 ...
##  $ ptratio: num  -1.458 -0.303 -1.504 -1.504 1.175 ...
##  $ black  : num  0.441 0.441 0.329 0.371 0.441 ...
##  $ lstat  : num  -1.074 -0.492 0.623 0.428 -0.615 ...
##  $ medv   : num  0.1595 -0.1014 -0.395 -0.0906 -0.2319 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 2 2 3 3 3 3 3 3 ...

(5) Fit linear discriminant analysis on the train set

# Linear discriminant analysis for ``crime`` using all other variables as
# predictor variables
model = lda(crime ~ ., data=train)
model
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2574257 0.2500000 0.2450495 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       1.0014540 -0.8852866 -0.114845058 -0.8824483  0.4420553 -0.9014894
## med_low  -0.1351529 -0.2871502 -0.007331936 -0.5593918 -0.1376444 -0.2977610
## med_high -0.3751653  0.1958381  0.234426408  0.4148963  0.0917352  0.4113685
## high     -0.4872402  1.0149946 -0.073485621  1.0342837 -0.4118197  0.7941491
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9366433 -0.7004968 -0.7174512 -0.44921851  0.37452017 -0.76598741
## med_low   0.3143360 -0.5534047 -0.4728851 -0.04963482  0.31820716 -0.11914919
## med_high -0.3676665 -0.3871700 -0.2887957 -0.31285581  0.04684449  0.04607382
## high     -0.8505090  1.6596029  1.5294129  0.80577843 -0.84650110  0.91889508
##                  medv
## low       0.518771304
## med_low   0.006260481
## med_high  0.152530321
## high     -0.726455970
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.13201446  0.721646602 -0.88701385
## indus    0.01624405 -0.198409149  0.10939050
## chas    -0.07094906 -0.019113941  0.08272880
## nox      0.22075035 -0.793025647 -1.46405727
## rm      -0.07094749 -0.095678126 -0.18468592
## age      0.32737892 -0.303857932 -0.06982634
## dis     -0.09801660 -0.258112426 -0.06626246
## rad      3.25270491  0.886620890 -0.12942189
## tax     -0.02104907 -0.007491377  0.62487297
## ptratio  0.13377564  0.049845381 -0.24509305
## black   -0.15553007 -0.008168146  0.09516348
## lstat    0.17920825 -0.243123788  0.34476706
## medv     0.12841204 -0.410734535 -0.19989112
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9480 0.0385 0.0135
classes = as.numeric(train$crime)
plot(model, dimen=2, col=classes, pch=classes)

From the plot we can see that high is relatively easily separable (except from a few med_high instances), whereas there is more difficulty with the other classes when mapped to two-dimensional space. However, they could still be easily separable in the higher-dimensional original space.

(6) Test LDA prediction

# Save the crime categories from the test set and remove the categorical
# crime variable from the test set
correct_classes = test$crime
test <- dplyr::select(test, -crime)
str(correct_classes)
##  Factor w/ 4 levels "low","med_low",..: 1 1 2 2 3 3 3 3 3 3 ...
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  0.2845 -0.4872 0.0487 0.0487 -0.4872 ...
##  $ indus  : num  -1.287 -0.593 -0.476 -0.476 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.265 -0.265 -0.144 ...
##  $ rm     : num  0.413 0.194 -0.399 -0.563 -0.478 ...
##  $ age    : num  -0.12 0.367 0.615 -1.051 -0.241 ...
##  $ dis    : num  0.14 0.557 1.328 0.786 0.433 ...
##  $ rad    : num  -0.982 -0.867 -0.522 -0.522 -0.637 ...
##  $ tax    : num  -0.666 -0.986 -0.577 -0.577 -0.601 ...
##  $ ptratio: num  -1.458 -0.303 -1.504 -1.504 1.175 ...
##  $ black  : num  0.441 0.441 0.329 0.371 0.441 ...
##  $ lstat  : num  -1.074 -0.492 0.623 0.428 -0.615 ...
##  $ medv   : num  0.1595 -0.1014 -0.395 -0.0906 -0.2319 ...
# Predict the classes with the LDA model on the test data
predictions = predict(model, newdata=test)

# Cross-tabulate the correct classes vs. predicted classes.
table(correct=correct_classes, predicted=predictions$class)
##           predicted
## correct    low med_low med_high high
##   low       17       8        2    0
##   med_low    6      14        2    0
##   med_high   0      11       14    0
##   high       0       0        1   27

The tabulation shows that a majority of all classifications were correct (i.e., they are on the diagonal). However, many were also classified into neighboring classes. No values were classed more than two steps away from the current class.

(7) K-means clustering

# Reload the original Boston dataset
data("Boston")

# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))

# Calculate distances between the (scaled) observations.  We use the
# Euclidean distance here (i.e., L2-norm).  Are we supposed to use these
# distances somehow?  The instructions do not say so.
dist_eu <- dist(boston_scaled, method="euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# Run k-means on the dataset.
km <- kmeans(boston_scaled, centers=4)

# Visualize the clustering for a few variable pairs as a sample
pairs(boston_scaled[6:10], col=km$cluster)

# Investigate the optimal number of clusters.  The "optimal" value is said to
# be where the WCSS drops radically.  (In real applications with overlapping
# clusters it is often not quite this simple.)
k_max = 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# Plot the WCSS as a function of the number of clusters.
qplot(x=1:k_max, y=twcss, geom="line")

We can see from the plot that WCSS drops significantly between 1 and 2 (though there is additional decrease with more clusters). Based on this, we pick 2 as the optimal number of clusters and run k-means again with that number of clusters. It would not be unreasonable to choose a higher number of classes, but there is no additional dramatic drops at higher numbers of classes.

# Run k-means again with the "optimal" number of clusters
km <- kmeans(boston_scaled, centers=2)

# Visualize the clusters (the "optimal" number of them)
pairs(boston_scaled, col=km$cluster)

# Given that visualizing all pairs is near unreadable, visualize for a small
# sample of variables.  We could use select() to pick a list of named
# variables instead, but instructions don't require it.
pairs(boston_scaled[6:10], col=km$cluster)

Looking at the visualized variables, we can see that some pairs of variables cluster the data pretty well. For example, tax, radf, and dis all seem to offer relatively good classifications.

Bonus exercise: Clustering and visualization with arrows

# Reload the original Boston dataset
data("Boston")

# Scale the variables to that all are normalized (zero mean and unit standard
# deviation).
boston_scaled <- as.data.frame(scale(Boston))

# Cluster into 3 clusters using k-means
km <- kmeans(boston_scaled, centers=3)

# Perform LDA using the clusters as target classes
lda.fit <- lda(x=boston_scaled, grouping=km$cluster)

# print the lda.fit object
lda.fit
## Call:
## lda(boston_scaled, grouping = km$cluster)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2786561 0.4130435 0.3083004 
## 
## Group means:
##         crim         zn      indus        chas        nox         rm
## 1 -0.3989734  1.1241495 -0.9749470 -0.04894749 -0.8314162  0.9328502
## 2 -0.3763256 -0.3947158 -0.1595373  0.01023793 -0.2992872 -0.2772450
## 3  0.8647905 -0.4872402  1.0949412  0.03052480  1.1524404 -0.4717159
##           age        dis        rad        tax     ptratio      black
## 1 -0.90687091  0.8988453 -0.5892747 -0.6843385 -0.77780359  0.3568316
## 2  0.02424663  0.0304962 -0.5856776 -0.5482746  0.09789185  0.2769803
## 3  0.78718751 -0.8532750  1.3172714  1.3530841  0.57186480 -0.6936034
##         lstat       medv
## 1 -0.92612124  0.9888052
## 2 -0.03824517 -0.1525634
## 3  0.88830984 -0.6893319
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03223565  0.18654924
## zn      -0.04675240  0.87238302
## indus    0.40550042 -0.16163062
## chas     0.08979225 -0.11703465
## nox      1.05432570  0.68362018
## rm      -0.26923690  0.50640378
## age     -0.21593250 -0.45834423
## dis      0.06906243  0.34138098
## rad      1.15943286  0.38841914
## tax      0.99119613  0.47397797
## ptratio  0.07545578 -0.16725670
## black   -0.07369899 -0.03242015
## lstat    0.26131112  0.32118017
## medv    -0.03902737  0.67098685
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8724 0.1276
# Visualize the results using a biplot with arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0,
         x1 = myscale * heads[,choices[1]],
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads),
       cex = tex, col="black", pos=3)
}

# Plot the LDA results using two dimensions, with arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 1)

# Plot again with much bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 5)

# Plot again with even bigger scale so that we can read the labels on the
# smaller arrows
plot(lda.fit, dimen = 2, col=km$cluster, pch=km$cluster)
lda.arrows(lda.fit, myscale = 15)

The length of the arrows reflects the directions of the different dimensions (variables) relative to the hyperplane chosen for visualizing the clusters. The longer the arrow, the closer it is to the direction of the hyperplane; the shorter, the closer it is perpendicular to the hyperplane.

It looks like the plane that R chose for visualization (which seems to separate the clusters reasonably well) is most parallel to rad (the nitrogen oxygen concentration), the second most parallel to tax (the accessibility of radial highways), and third most parallel to age (proportion of owner-occupied units built prior to 1940).

Super-Bonus: 3D plots

# Save gold standard classes from the training data
train_correct <- train$crime

# Recompute the LDA model from the training data
model_predictors <- dplyr::select(train, -crime)
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 14  2
model <- lda(crime ~ ., data=train)

# Map the datapoints into the visualization space
matrix_product <- as.matrix(model_predictors) %*% model$scaling
matrix_product <- as.data.frame(matrix_product)

# Create a 3D plot with color indicating the gold standard crime
# classes in the train set
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
        type="scatter3d", mode="markers",
        color=train_correct, marker=list(size=2))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Create a 3D plot with color indicating the predicted crime classes.  Since
# the gold data used four classes, we'll predict four classes.
train_no_crime = dplyr::select(train, -crime)
km <- kmeans(train_no_crime, centers=4)
plot_ly(matrix_product, x=~LD1, y=~LD2, z=~LD3,
        type="scatter3d", mode="markers",
        color=km$cluster, marker=list(size=2))

Both plots contain the same training points, so their point locations mapped to the visualization space are identical. We can see that the green, orange, and blue classes in the original data overlap quite a bit and are not well separated. The prediction, however, produces much shaper class boundaries. This is as expected, because clustering is based on a distance to cluster centers. While there are several dimensions not shown in these plots, one could expect these dimensions to partially reflect the distance from cluster centers. This is consistent with what we are seeing in the plot for predicted data. Except for the sharpening of class boundaries and the selection of a different color scheme, the plots are fairly similar.

Data wrangling

The data wranling script that was part of Exercise 4 can be found in my github repository in the data directory, here.


Course diary

Week 1

Thoughts after the first week:

  • Created IODS-project repository
  • Installed R and Rstudio on Ubuntu 20.04 (Linux). Rstudio (latest 1.3.1093) keeps crashing very frequenty. R seems to work. For now I’m using Rscript to generate HTML from the .Rmd files.
  • I’ve used github for years so that part was easy.
  • Rstudio did not clone the github repository as I would like. It did not store the username in the cloned repository correctly. Thus manual git pull or git push did not work properly with SSH keys (it asked for git username and password). I had to tweak it manually to make SSH keys work correctly.
  • The actual information on the lecture could have been presented much faster. I will probably read the information from the book/web pages/transcripts in the coming weeks rather than listening to lectures.

Week 2

Thoughts for the second week:

  • I went through the DataCamp exercises. The platform was simple and easy to use, even though this would not have been my preferred mode of learning. I would rather have read a description of the R philosophy and how main commands work and then approached it as a programming language. Now rather than learning the basics we are forced to learn snippets that may be useful in themselves, but we don’t learn to understand what options and commands really mean or what the philosophy and basic concepts are.
  • It took some time to find out how to use R. I’m not a fan of its programming language syntax, but it certainly seems useful for various statistical and plotting tasks once you learn the different libraries. The challenge is that you need to become familiar with a number of libraries. In some sense python, pandas, scikit, and matplotlib still feels easier for me.
  • I’ve used various types of linear regression many times in the past (least squares, Lasso, Ridge), but haven’t really looked at significance analysis and the distribution of residual errors in the past. I think that analysis may turn out to be a useful tool for me in the future. I would probably implement in Python for my machine learning applications rather than use R though.
  • Rmarkdown is kind of a cool idea and nice for prototyping or coursework, but I still fail to see how to utilize it for an academic paper. The nice thing is that it leaves a trail that makes repeating the procedure easy.

Week 3

Thoughts after the third week

  • I’m becoming frustrated. I read throught the chapters of the book, but found them too general and vague, lacking precise analysis and description of the topics. Perhaps I’d prefer a more mathematical approach to the topics.
  • I’m finding I really dislike R syntax and the study approach taken in the data camp exercises. They are not hard, but they are performing tasks without first gaining an understanding of the available commands and operations. We haven’t look at even the basic concepts of the programming language that R is, yet we are learning and memorizing snippets in the hope that they might be used for something useful. As someone with a long programming background, I’m finding this approach very frustrating, inefficient, and annoying. I’m waiting for a book on R programming to arrive.
  • I do appreciate the graphics and significant testing that is easily available with R packages.
  • The Super-Bonus exercise was fun.

Week 4

Thoughts after the fourth week

  • These exercises are taking quite a few hours to do. They are not difficult, but there is a lot of tedious detail. This is working to teach some rote skills, but the concepts and overall approach in R has not been discussed. R really looks like a hack, though a lot of people have put a lot of effort into it.
  • I got a book on R, which seems to help somewhat, but it doesn’t quite go as deep as I’d like either. It’s Nicholas J. Horton and Ken Kleinman: Using R and RStudio for Data Management, Statistical Analysis, and Graphics, 2nd ed, CRC Press, 2015. The book is helpful though.

(more chapters to be added similarly as we proceed with the course!)